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Many statistical applications require an estimate of a covariance 
matrix and/or its inverse. When the matrix dimension is large com- 
pared to the sample size, which happens frequently, the sample co- 
variance matrix is known to perform poorly and may suffer from 
ill-conditioning. There already exists an extensive literature concern- 
ing improved estimators in such situations. In the absence of fur- 
ther knowledge about the structure of the true covariance matrix, 
the most successful approach so far, arguably, has been shrinkage 
estimation. Shrinking the sample covariance matrix to a multiple of 
the identity, by taking a weighted average of the two, turns out to 
be equivalent to linearly shrinking the sample eigenvalues to their 
grand mean, while retaining the sample eigenvectors. Our paper ex- 
tends this approach by considering nonlinear transformations of the 
sample eigenvalues. We show how to construct an estimator that is 
asymptotically equivalent to an oracle estimator suggested in pre- 
vious work. As demonstrated in extensive Monte Carlo simulations, 
the resulting bona fide estimator can result in sizeable improvements 
over the sample covariance matrix and also over linear shrinkage. 

1. Introduction. Many statistical applications require an estimate of 
a covariance matrix and/or of its inverse when the matrix dimension, p, 
is large compared to the sample size, n. It is well known that in such situa- 
tions, the usual estimator — the sample covariance matrix — performs poorly. 
It tends to be far from the population covariance matrix and ill-conditioned. 
The goal then becomes to find estimators that outperform the sample co- 
variance matrix, both in finite samples and asymptotically. For the purposes 
of asymptotic analyses, to reflect the fact that p is large compared to n, one 



Received October 2010; revised December 2011. 

1 Supported by the NCCR Finrisk project "New Methods in Theoretical and Empirical 
Asset Pricing." 

AMS 2000 subject classifications. Primary 62H12; secondary 62G20, 15A52. 
Key words and phrases. Large-dimensional asymptotics, nonlinear shrinkage, rotation 
equi variance. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2012, Vol. 40, No. 2, 1024-1060. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



O. LEDOIT AND M. WOLF 



has to employ large-dimensional asymptotics where p is allowed to go to 
infinity together with n. In contrast, standard asymptotics would assume 
that p remains fixed while n tends to infinity. 

One way to come up with improved estimators is to incorporate additional 
knowledge in the estimation process, such as sparseness, a graph model or 
a factor model; for example, see Bickel and Levina (2008), Rohde and Tsy- 
bakov (2011), Cai and Zhou (2012), Ravikumar et al. (2008), Rajaratnam, 
Massam and Carvalho (2008), Khare and Rajaratnam (2011), Fan, Fan and 
Lv (2008) and the references therein. 

However, not always is such additional knowledge available or trustwor- 
thy. In this general case, it is reasonable to require that covariance matrix 
estimators be rotation-equivariant. This means that rotating the data by 
some orthogonal matrix rotates the estimator in exactly the same way. In 
terms of the well-known decomposition of a matrix into eigenvectors and 
eigenvalues, an estimator is rotation-equivariant if and only if it has the 
same eigenvectors as the sample covariance matrix. Therefore, it can only 
differentiate itself by its eigenvalues. 

Ledoit and Wolf (2004) demonstrate that the largest sample eigenvalues 
are systematically biased upwards, and the smallest ones downwards. It is 
advantageous to correct this bias by pulling down the largest eigenvalues 
and pushing up the smallest ones, toward the grand mean of all sample 
eigenvalues. This is an application of the general shrinkage principle, going 
back to Stein (1956). Working under large-dimensional asymptotics, Ledoit 
and Wolf (2004) derive the optimal linear shrinkage formula (when the loss 
is defined as the Frobenius norm of the difference between the estimator and 
the true covariance matrix). The same shrinkage intensity is applied to all 
sample eigenvalues, regardless of their positions. For example, if the linear 
shrinkage intensity is 0.5, then every sample eigenvalue is moved half-way 
toward the grand mean of all sample eigenvalues. Ledoit and Wolf (2004) 
both derive asymptotic optimality properties of the resulting estimator of 
the covariance matrix and demonstrate that it has desirable finite-sample 
properties via simulation studies. 

A cursory glance at the Marcenko and Pastur (1967) equation, which 
governs the relationship between sample and population eigenvalues under 
large-dimensional asymptotics, shows that linear shrinkage is the first-order 
approximation to a fundamentally nonlinear problem. How good is this ap- 
proximation? Ledoit and Wolf (2004) are very clear about this. Depending 
on the situation at hand, the improvement over the sample covariance ma- 
trix can either be gigantic or minuscule. When p/n is large, and/or the 
population eigenvalues are close to one another, linear shrinkage captures 
most of the potential improvement over the sample covariance matrix. In 
the opposite case, that is, when p/n is small and/or the population eigen- 
values are dispersed, linear shrinkage hardly improves at all over the sample 
covariance matrix. 
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The intuition behind the present paper is that the first-order approxima- 
tion does not deliver a sufficient improvement when higher-order effects are 
too pronounced. The cure is to upgrade to nonlinear shrinkage estimation 
of the covariance matrix. We get away from the one-size-fits-all approach by 
applying an individualized shrinkage intensity to every sample eigenvalue. 
This is more challenging mathematically than linear shrinkage because many 
more parameters need to be estimated, but it is worth the extra effort. Such 
an estimator has the potential to asymptotically at least match the linear 
shrinkage estimator of Ledoit and Wolf (2004) and often do a lot better, 
especially when linear shrinkage does not deliver a sufficient improvement 
over the sample covariance matrix. As will be shown later in the paper, 
this is indeed what we achieve here. By providing substantial improvement 
over the sample covariance matrix throughout the entire parameter space, 
instead of just part of it, the nonlinear shrinkage estimator is as much of 
a step forward relative to linear shrinkage as linear shrinkage was relative to 
the sample covariance matrix. In terms of finite-sample performance, the lin- 
ear shrinkage estimator rarely performs better than the nonlinear shrinkage 
estimator. This happens only when the linear shrinkage estimator is (nearly) 
optimal already. However, as we show in simulations, the outperformance 
over the nonlinear shrinkage estimator is very small in such cases. Most of 
the time, the linear shrinkage estimator is far from optimal, and nonlinear 
shrinkage then offers a considerable amount of finite-sample improvement. 

A formula for nonlinear shrinkage intensities has recently been proposed 
by Ledoit and Peche (2011). It is motivated by a large-dimensional asymp- 
totic approximation to the optimal finite-sample rotation-equivariant shrink- 
age formula under the Frobenius norm. The advantage of the formula of 
Ledoit and Peche (2011) is that it does not depend on the unobservable 
population covariance matrix: it only depends on the distribution of sam- 
ple eigenvalues. The disadvantage is that the resulting covariance matrix 
estimator is an oracle estimator in that it depends on the "limiting" distri- 
bution of sample eigenvalues, not the observed one. These two objects are 
very different. Most critically, the limiting empirical cumulative distribution 
function (c.d.f.) of sample eigenvalues is continuously differentiable, whereas 
the observed one is, by construction, a step function. 

The main contribution of the present paper is to obtain a bona fide estima- 
tor of the covariance matrix that is asymptotically as good as the oracle esti- 
mator. This is done by consistently estimating the oracle nonlinear shrinkage 
intensities of Ledoit and Peche (2011), in a uniform sense. As a by-product, 
we also derive a new estimator of the limiting empirical c.d.f. of population 
eigenvalues. A previous such estimator was proposed by El Karoui (2008). 

Extensive Monte Carlo simulations indicate that our covariance matrix 
estimator improves substantially over the sample covariance matrix, even 
for matrix dimensions as low as p = 30. As expected, in some situations 
the nonlinear shrinkage estimator performs as well as Ledoit and Wolf's 
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(2004) linear shrinkage estimator, while in others, where higher-order effects 
are more pronounced, it does substantially better. Since the magnitude of 
higher-order effects depends on the population covariance matrix, which is 
unobservable, it is always safer a priori to use nonlinear shrinkage. 

Many statistical applications require an estimate of the precision matrix, 
which is the inverse of the covariance matrix, instead of (or in addition to) 
an estimate of the covariance matrix itself. Of course, one possibility is to 
simply take the inverse of the nonlinear shrinkage estimate of the covariance 
matrix itself. However, this would be ad hoc. The superior approach is to 
estimate the inverse covariance matrix directly by nonlinearly shrinking the 
inverses of the sample eigenvalues. This gives quite different and markedly 
better results. We provide a detailed, in-depth solution for this important 
problem as well. 

The remainder of the paper is organized as follows. Section 2 defines our 
framework for large-dimensional asymptotics and reviews some fundamen- 
tal results from the corresponding literature. Section 3 presents the oracle 
shrinkage estimator that motivates our bona fide nonlinear shrinkage esti- 
mator. Sections 4 and 5 show that the bona fide estimator is consistent for 
the oracle estimator. Section 6 examines finite-sample behavior via Monte 
Carlo simulations. Finally, Section 7 concludes. All mathematical proofs are 
collected in the supplement [Ledoit and Wolf (2012)]. 

2. Large-dimensional asymptotics. 

2.1. Basic framework. Let n denote the sample size and p = p(n) the 
number of variables, with p/n — > c € (0, 1) as n — > oo. This framework is 
known as large-dimensional asymptotics. The restriction to the case c < 1 
that we make here somewhat simplifies certain mathematical results as well 
as the implementation of our routines in software. The case c> 1, where the 
sample covariance matrix is singular, could be handled by similar methods, 
but is left to future research. 

The following set of assumptions will be maintained throughout the paper. 

(Al) The population covariance matrix T, n is a nonrandom p-dimensional 
positive definite matrix. 

(A2) Let X n be an n x p matrix of real independent and identically dis- 
tributed (i.i.d.) random variables with zero mean and unit variance. 

One only observes Y n = X n T,l/ 2 , so neither X n nor S n are observed 
on their own. 

(A3) Let ((r ni i, . . . , r n ,, p ); (u n ,i, • • • > v n,p)) denote a system of eigenvalues and 
eigenvectors of S n . The empirical distribution function (e.d.f.) of the 
population eigenvalues is defined as Vi G M, H n (t) =p~ l Yli=i %r„ l5 +oo)(0> 
where 1 denotes the indicator function of a set. We assume H n (t) con- 
verges to some limit H (t) at all points of continuity of H. 
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(A4) Supp(-ff), the support of H, is the union of a finite number of closed 
intervals, bounded away from zero and infinity. Furthermore, there 
exists a compact interval in (0, +00) that contains Supp(-ff n ) for all n 
large enough. 

Let ((A ni i, . . . , A n)P ); (u n> i, . . . , u n>p )) denote a system of eigenvalues and 

eigenvectors of the sample covariance matrix S n = n~ l Y^Y n = n~ l T^I 2 X' n X n : 
1/2 

S n • We can assume that the eigenvalues are sorted in increasing order 
without loss of generality (w.l.o.g.). The first subscript, n, will be omitted 
when no confusion is possible. The e.d.f. of the sample eigenvalues is defined 
as VA € R,F n (X) ^p~ l £f =1 l[A i ,+oc)(A). 

In the remainder of the paper, we shall use the notation Re(z) and Im(z) 
for the real and imaginary parts, respectively, of a complex number z, so 
that 

VzeC z = Re(z)+i-Im(z). 

The Stieltjes transform of a nondecreasing function G is defined by 

r+00 1 

(2.1) VzeC+ m G {z)= / dG(X), 



CO 



X-Z 



where C + is the half-plane of complex numbers with strictly positive imag- 
inary part. The Stieltjes transform has a well-known inversion formula, 

G(b)-G(a)= lim - I im[m G (£ + mj)1 df, 

which holds if G is continuous at a and b. Thus, the Stieltjes transform of 
the e.d.f. of sample eigenvalues is 

1 P 1 1 

Vz G C+ m Fn (z) = - V = -Tr[(S n - zl)' 1 }, 

V^i Xi-z p 

where / denotes a conformable identity matrix. 

2.2. Marcenko-Pastur equation and reformulations. Marcenko and Pas- 
tur (1967) and others have proven that F n (X) converges almost surely (a.s.) 
to some nonrandom limit F(X) at all points of continuity of F under certain 
sets of assumptions. Furthermore, Marcenko and Pastur discovered the equa- 
tion that relates mj? to H . The most convenient expression of the Marcenko- 
Pastur equation is the one found in Silverstein [(1995), equation (1.4)], 



(2.2) Vz€C+ m F (z)= / -=- — dH(r). 

J -00 T[l-c-czm F (z)\- z 

This version of the Marcenko-Pastur equation is the one that we start out 
with. In addition, Silverstein and Choi (1995) showed that 

VA€K-{0} lim m F (z) = m F (X) 
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exists, and that F has a continuous derivative F' = 7r~ 1 Im[m^] on all of R 
with F' = on (— oo,0]. For purposes that will become apparent later, it is 
useful to reformulate the Marcenko-Pastur equation. 

The limiting e.d.f. of the eigenvalues of n~ 1 Y^ l Y n = n -1 !^ 5 '' X^Xn^j/ 2 
was defined as F. In addition, define the limiting e.d.f. of the eigenvalues of 
•nT x Y n Yj l = n^Xn^nX^ as F. It then holds 

Vx£R F(x) = {l-c)t [0i+oo) (x) + cF(x), 

Vxeffi F(x) = —l [0i+oo) (x) + -F{x), 
c 1 ' c 

c — 1 

Vz £ C + rriF_{z) = \-cmp(z), 

1 — c 1 

Vz € C + mpiz) = 1 — rnp(z). 

cz c — 

With this notation, equation (1.3) of Silverstein and Choi (1995) rewrites 
the Marcenko-Pastur equation in the following way: for each z 6 C + , mp{z) 
is the unique solution in C + to the equation 

+oo _ 1-1 



(2.3) mF_{z) 



1 + T1TlF_(z) 



dH(r) 



Now introduce uf(z) = —l/mp^z). Notice that uf{z) £ C + <^=^» mp{z) € 
C + . The mapping from uf{z) to itif(z) is one-to-one on C + . 

With this change of variable, equation (2.3) is equivalent to saying that 
for each z E C + , U£_(z) is the unique solution in C + to the equation 

f + OO 

(2.4) u F {z) = z + cu F {z) —dH(r). 

- ~ J-oo T-Uf(z) 

Let the linear operator L transform any e.d.f. G into 

LG(x)= I tcLG{t). 

J — oo 

Combining L with the Stieltjes transform, we get 

r+oo T 

mLG(z)= cLG(t) = 1 + zm G {z). 

J-oo T-Z 

Thus, we can rewrite equation (2.4) more concisely as 

(2.5) uf_(z) = Z + CUF_{z)mLH(uF{z)). 

As Silverstein and Choi [(1995), equation (1.4)] explain, the function defined 
in equation (2.3) is invertible. Thus we can define the inverse function 

1 f + °° T 

(2.6) z F (m) = +c — dH(r). 

~ m J-oo 1+rm 
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We can do the same thing for equation (2.5) and define the inverse function 

(2.7) zf_(u) = u — cumLH(u)- 

Equations (2.2), (2.3), (2.5), (2.6) and (2.7) are all completely equivalent to 
one another; solving any one of them means having solved them all. They 
are all just reformulations of the Marcenko-Pastur equation. 

As will be detailed in Section 3, the oracle nonlinear shrinkage estimator 
of S n involves the quantity mi?(A), for various inputs A. Section 2.3 describes 
how this quantity can be found in the hypothetical case that F and H are 
actually known. This will then allow us later to discuss consistent estimation 
of mi?(A) in the realistic case when F and H are unknown. 

2.3. Solving the Marcenko-Pastur equation. Silverstein and Choi (1995) 
explain how the support of F, denoted by Supp(i ? ), is determined. Let 
B = {ueR:u^0,ue Supp C (#)}. Then plot the function zp(u) of (2.7) on 
the set B. Find the extreme values on each interval. Delete these points and 
everything in between on the real line. Do this for all increasing intervals. 
What is left is just Supp(-F); see Figure 1 of Bai and Silverstein (1998) for 
an illustration. 

To simplify, we will assume from here on that Supp(-F) is a single com- 
pact interval, bounded away from zero, with F' > in the interior of this 
interval. But if Supp(-F) is the union of a finite number of such intervals, 
the arguments presented in this section as well as in the remainder of the 
paper apply separately to each interval. In particular, our consistency re- 
sults presented in subsequent sections can be easily extended to this more 
general case. On the other hand, the even more general case of Supp(-F) be- 
ing the union of an infinite number of such intervals or being a noncompact 
interval is ruled out by assumption (A4). By our assumption then, Supp(-F) 
is given by the compact interval \zf_{u\), 2f(u2)] for some u\ < U2- To keep 
the notation shorter in what follows, let z\ = zf_{u\) and J,i = 2^(2x2). 

We know that for every A in the interior of Supp(F), there exists a unique 
v E C + , denoted by v\ , such that 

(2.8) v x - cv x m LH (v x ) = A. 
We further know that 

F'(X) = -Z'(A) = — lm[m F {\)} = — Im 

C C7T C7T 

The converse is also true. Since Supp(F) = [zp(ui), Z£_(u2)], for every x € 
(^1,7/2), there exists a unique y > 0, denoted by y x , such that 




(x + iy x ) - c(x + iy x )m LH (x + iy x ) G R. 
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In other words, y x is the unique value of y > for which Im[(x + iy) — c(x + 
iy)rriLH( x + w)} = 0. Also, if A^ denotes the value of A for which we have 
(x + iy x ) - c(x + iy x )m LH (x + iy x ) = A, then, by definition, z Xx = x + iy x . 

Once we find a way to consistently estimate y x for any x S [1*1,1*2], then 
we have an estimate of the (asymptotic) solution to the Marcenko-Pastur 
equation. For example, Im[— l/(x + iy x )]/(c]r) is the value of the density F' 
evaluated at Re[(x + iy x ) — c(x + iy x )rriLH(x + iy x )] = (x + iy x ) — c(x + 
iyx)m LH (x + iy x ). 

From the above arguments, it follows that 

(2.9) VA€(2 r i,Z2) tof(A) = and so rhp (A) = — . 

~ v\ cA c v\ 

3. Oracle estimator. 



3.1. Covariance matrix. In the absence of specific information about the 
true covariance matrix E n , it appears reasonable to restrict attention to the 
class of estimators that are equivariant with respect to rotations of the 
observed data. To be more specific, let W be an arbitrary p-dimensional or- 
thogonal matrix. Let E n = T, n (Y n ) be an estimator of S n . Then the estimator 
is said to be rotation- equivariant if it satisfies T, n (Y n W) = W'"E n (Y n )W . In 
other words, the estimate based on the rotated data equals the rotation of 
the estimate based on the original data. The class of rotation-equivariant 
estimators of the covariance matrix is constituted of all the estimators that 
have the same eigenvectors as the sample covariance matrix; for example, 
see Perlman [(2007), Section 5.4]. Every rotation-equivariant estimator is 
thus of the form 

U n D n U' n where D n = Diag(di, . . . , d p ) is diagonal, 

and where U n is the matrix whose ith column is the sample eigenvector 
u i = Un,i- This is the class we consider. 

The starting objective is to find the matrix in this class that is closest 
to S n . To measure distance, we choose the Frobenius norm defined as 



(3.1) 11^411 = y/Tr (AA')/r for any matrix A of dimension r x m. 

[Dividing by the dimension of the square matrix AA' inside the root is not 
standard, but we do this for asymptotic purposes so that the Frobenius 
norm remains constant equal to one for the identity matrix regardless of the 
dimension; see Ledoit and Wolf (2004).] As a result, we end up with the 
following minimization problem: 

min||LT n D n LT^ - S n ||. 

Elementary matrix algebra shows that its solution is 

(3.2) £)* = Diag(d*, . . . , d*) where d* = u^S n t*j for i = 1, . . . ,p. 
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The interpretation of d* is that it captures how the ith sample eigenvector m 
relates to the population covariance matrix S n as a whole. As a result, the 
finite-sample optimal estimator is given by 

(3.3) S* = U n D* n U' n where D* is defined as in (3.2). 

By generalizing the Marcenko-Pastur equation (2.2), Ledoit and Peche 
(2011) show that d% can be approximated by the quantity 

(3.4) d? = - X \ forz=l,...,p, 

1 - c - c\iTn F (\i)\ z 

from which they deduce their oracle estimator 

(3.5) S° n r = U n D%U' n where D° r = Diag«, . . . , d° p r ). 

The key difference between D* and D° r is that the former depends on the 
unobservable population covariance matrix, whereas the latter depends on 
the limiting distribution of sample eigenvalues, which makes it amenable to 
estimation, as explained below. 

Note that constitutes a nonlinear shrinkage estimator: since the value 
of the denominator of df varies with Aj , the shrunken eigenvalues d° r are ob- 
tained by applying a nonlinear transformation to the sample eigenvalues Aj; 
see Figure 3 for an illustration. Ledoit and Peche (2011) also illustrate in 
some (limited) simulations that this oracle estimator can provide a magni- 
tude of improvement over the linear shrinkage estimator of Ledoit and Wolf 
(2004). 

3.2. Precision matrix. Often times an estimator of the inverse of the 
covariance matrix, or the precision matrix, S" 1 is required. A reasonable 
strategy would be to first estimate E n , and to then simply take the inverse 
of the resulting estimator. However, such a strategy will generally not be 
optimal. 

By arguments analogous to those leading up to (3.3), among the class of 
rotation-equivariant estimators, the finite-sample optimal estimator of S" 1 
with respect to the Frobenius norm is given by 

(3.6) P* = U n A* n U' n where a* = ti^E" 1 ^ for i = 1, . . . ,p. 

In particular, note that P* ^ (S^) _1 in general. 

Studying the asymptotic behavior of the diagonal matrix A* n led Ledoit 
and Peche (2011) to the following oracle estimator: 

p or = TT A or TT' 

(3.7) 

where a° r = X i 1 (1 — c — 2cAjRe[mp(Ai)]) for i = 1, . . . ,p. 
In particular, note that P° r ^ (S^) -1 in general. 
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Remark 3.1. 

volve the unknown quantities m^(Aj), for i = 1, . . . ,p. As a result, they are 
not bona fide estimators. However, being able to consistently estimate m^(A), 
uniformly in A, will allow us to construct bona fide estimators S n and P n 
that converge to their respective oracle counterparts almost surely (in the 
sense that the Frobenius norm of the difference converges to zero almost 
surely). 

Section 4 explains how to construct a uniformly consistent estimator 
of mp(A) based on a consistent estimator of H , the limiting spectral distri- 
bution of the population eigenvalues. Section 5 discusses how to construct 
a consistent estimator of H from the data. 

3.3. Further details on the results of Ledoit and Peche (2011). Ledoit 
and Peche (2011) (hereafter LP) study functionals of the type 

N N 

V, € C+ @%(z) = - £ — z £ KV x g( Tj ) 

1=1 1 7 = 1 

(3.8) 

= -Tr[(S N -ziy 1 g(i; N )), 

where g is any real-valued univariate function satisfying suitable regular- 
ity conditions. Comparison with equation (2.1) reveals that this family of 
functionals generalizes the Stieltjes transform, with the Stieltjes transform 
corresponding to the special case g = l. What is of interest is what happens 
for other, nonconstant functions g. 

It turns out that it is possible to generalize the Marcenko-Pastur re- 
sult (2.2) to any function g with finitely many points of discontinuity. Un- 
der assumptions that are usual in the Random Matrix Theory literature, 
LP prove in their Theorem 2 that there exists a nonrandom function Q g 
defined over C + such that Q 9 N (z) converges a.s. to Q 9 (z) for all z E C + . 
Furthermore, s is given by 

(3.9) Vz € C + 9*(s) = / + °° -j 9{T) dH{r). 

J-oo t[1 - c - czm F (z)\ - z 

What is remarkable is that, as one moves from the constant function g = 1 
to any other function g(r), the integration kernel T ^i_ c _czm F (z)]-z remams 
unchanged. Therefore equation (3.9) is a direct generalization of Marcenko 
and Pastur's foundational result. 

The power and usefulness of this generalization become apparent once one 
starts plugging specific, judiciously chosen functions g{r) into equation (3.9). 
For the purpose of illustration, LP work out three examples of functions g(r). 

The first example of LP is g(r) = l(_oo,r)j where 1 denotes the indicator 
function of a set. It enables them to characterize the asymptotic location of 
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sample eigenvectors relative to population eigenvectors. Since this result is 
not directly relevant to the present paper, we will not elaborate further, and 
refer the interested reader to LP's Section 1.2. 

The second example of LP is g(r) = t. It enables them to characterize 
the asymptotic behavior of the quantities d° r introduced in equation (3.4). 
More formally, for any ti£ (0, 1), define 

[u-p\ [u-p\ 

(3.10) A;(«) = -£< and A^(u) = - £ d?\ 

p ^-^ p ^-^ 

F i=l y i=l 

where ['J denotes the integer part. LP's Theorem 4 proves that A*(u) — 
A»^0a.s. 

The third example of LP is g(r) = 1/r. It enables them to characterize 
the asymptotic behavior of the quantities a° r introduced in equation (3.7). 
For any m£ (0, 1) define 

(3.11) = and «£■(«) = tf\ 

LP's Theorem 5 proves that \[/* (it) — *£, r (it) — > a.s. 

4. Estimation of mjr(A). Fix x G [iti + 77, tt2 — 77], where 77 > is some 
small number. From the previous discussion in Section 2, it follows that the 
equation 

Im[x + iy — c(x + iy)rriLH(x + iy)] = 

has a unique solution y G (0, +00), called y x . Since u\ < x < 112, it follows 
that y x > 0; for x = u\ or x = U2, we would have y x = instead. The goal is 
to consistently estimate y x , uniformly in x G [u\ + 77, Ui — rj]. 
Define for any c.d.f. G and for any d > 0, the real function 

9gAv^ x ) = \Im[x + iy - d(x + iy)m LG (x + iy)}\. 
With this notation, y x is the unique minimizer in (0, +00) of gn.c(y, x ) then. 
In particular, gH,c{Vxi x ) = °- 

In the remainder of the paper, the symbol =>■ denotes weak convergence 
(or convergence in distribution). 

Proposition 4.1. (i) Let {H n } be a sequence of probability measures 
with H n => H. Let {c n } be a sequence of positive real numbers with c n — > c. 
Let K C (0, 00) be a compact interval satisfying {y x : x G [u\ +77, U2 — rj]} C K . 
For a given x G [u\ + 77, U2 — i]] , let y n>x = min y6 ^: gg ~ (y, x) . It then holds 
that y HtX — > y x uniformly in x G [u\ + 77, U2 — r)] . 

(ii) In case of H n =4> H a.s., it holds that y n<x — > y x a.s. uniformly in 
x G [u\ + 77, U2 — 77] . 
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It should be pointed out that the assumption {y x :x& [u\ + rj, U2 — Tj] } C K 
is not really restrictive, since one can choose K= [e,l/e], for e arbitrarily 
small. 

We also need to solve the "inverse" estimation problem, namely^starting 
with A and recovering the corresponding v\. Fix As \z\ + 5, Z2 — 6], where 
5 > is some small number. From the previous discussion, it follows that 
the equation 

v — cvrriLH{v) = A 

has a unique solution v G C + , called v\. The goal is to consistently esti- 
mate v\, uniformly in A € \z\ + 5, Z2 — 5]. 

Define for any c.d.f. G and for any d > 0, the real function 

hG,d(v,\) = \v- dvm LG (v) - Aj. 

With this notation, v\ is then the unique minimizer in C + of Hh,c(v, A). In 
particular, h,H jC (v\,X) = 0. 

Proposition 4.2. (i) Let {H n } be a sequence of probability measures 
with H n =>- H. Let {c n } be a sequence of positive real numbers with c n — > c. 
Let K C C + be a compact set satisfying {v\ : A G [z± + S,Z2 — 5}} C K . For 
a given A G \z\ + 6, Z2 — S] , let v n \ = vam. v cK ~ (v,X). It then holds that 

tin >Cn 

v n ,\ — > V\ uniformly in AG [Si + 6, Z2 — 6] . 

(ii) In case of H n =>■ H a.s., it holds that v nt \ — > v\ a.s. uniformly in 
A G [z! + 6, Z2-6]. 

Being able to find consistent estimators of v\, uniformly in A, now allows 
us to find consistent estimators of mp(A), uniformly in A, based on (2.9). 
Our estimator of mp(A) is given by 

(4.1) m F „ (A) = i=^-iJ-. 

Hn ' Cn C n X C n V njX 

This, in turn, provides us with a consistent estimator of S n r , the oracle 
nonlinear shrinkage estimator of E n . Define 

Sn — U n D n U n 

(4 ' 2) ~ A 

where * = rrr — ?xi^ — ioTi=1 >--->p- 

It also provides us with a consistent estimator of P° r , the oracle nonlinear 
shrinkage estimator of X" 1 . Define 

Pn = U n A n U n 

(4.3) 

where a; = A. 1 (1 -c n - 2c n XiRe[m F _ (Aj)]) for i = 1,.. . ,p. 

H n ,c n 
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In particular, note that P n ^ S n 1 in general. 
Proposition 4.3. 

(i) Let {H n } be a sequence of probability measures with H n =>■ H . Let {c n } 
be a sequence of positive real numbers with c n — > c. It then holds that: 

(a) rhp- „ (A) — >■ mir(A) uniformly in A € \z\ + 6,2:2 — 5]; 

(b) \\S n -S^\\^0; 

(c) ||P n -P-||^0. 

(ii) In case of H n H a.s., it holds that: 

(a) rhp- (A)— >?bf(A) uniformly in A € [Si + <5, S2 — 5] a.s.; 

(b) \\S n -S^\\^0 a.s.; 

(c) \\P n -P^\\^0 a.s. 

5. Estimation of H. As described before, consistent estimation of the 
oracle estimators of Ledoit and Peche (2011) requires (uniformly) consis- 
tent estimation of mp(A). Since Im[?n^(A)] = ttF'(X), one possible approach 
could be to take an off-the-shelf density estimator for F', based on the ob- 
served sample eigenvalues Aj. There exists a large literature on density es- 
timation; for example, see Silverman (1986). The real part of rn^(Aj) could 
be estimated in a similar manner. 

However, the sample eigenvalues do not satisfy any of the regularity con- 
ditions usually invoked for the underlying data. It really is not clear at all 
whether an off-the-shelf density estimator applied to the sample eigenvalues 
would result in consistent estimation of F' . 

Even if this issue was somehow resolved, using such a generic procedure 
would not exploit the specific features of the problem. Namely: F is not just 
any distribution; it is a distribution of sample eigenvalues. It is the solution 
to the Marcenko-Pastur equation for some H. This is valuable informa- 
tion that narrows down considerably the set of possible distributions F. 
Therefore an estimation procedure specifically designed to incorporate this 
a priori knowledge would be better suited to the problem at hand. This is 
the approach we select. 

In a nutshell: our estimator of F is the c.d.f. that is closest to F n among 
the c.d.f. 's that are a solution to the Marcenko-Pastur equation for some H 
and for c = c n =p/n. The "underlying" distribution H that produces the 
thus obtained estimator of F is, in turn, our estimator of H . If we can show 
that this estimator of H is consistent, then the results of the previous section 
demonstrate that the implied estimator of rhp^A) is uniformly consistent. 

Section 5.1 derives theoretical properties of this approach, while Sec- 
tion 5.2 discusses various issues concerning the practical implementation. 
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5.1. Consistency results. For a grid of real numbers Q = {. . . ,t_i, to, 
ii, . . .} C R, with tk-i < tk, define the corresponding grid size 7 as 

7 = sup(£ fc -ifc-i). 

A grid Q is said to cover a compact interval [a, 6] C R if there exists at least 
one tk € Q with t^ <a and at least another 6 Q with 6 < ty . A sequence 
of grids {Qn} is said to eventually cover a compact interval [a, b] if for 
every <f> > there exist A r = iV(0) such that Q n covers the compact interval 
[a + (f>,b-<f>] for all n>N. 

For any probability measure H on the real line and for any c > 0, let ~ 
denote the c.d.f. on the real line induced by the corresponding solution of 
the Marcenko-Pastur equation. More specifically, for each z € C + , mp~ _(z) 

is the unique solution for m £ C + to the equation 

/+°° 1 ~ 
-r 1 dH(T). 
.oo t[1 -c-czm\ -z 

In this notation, we then have F = Fjj jC . 

It follows from Silverstein and Choi (1995) again that 

VAeR-{0j lim m F ~ (z) = m F ^ (A) 

exists, and that F%~ has a continuous derivative F'~ _ = 7r Imfm??- 1 on 

(0, +00). In the case c < 1, i 7 ^ ~ has a continuous derivative on all of R with 

F'~ =0 on (-oo,0l. 

For a grid Q on the real line and for two c.d.f.'s G\ and G2, define 

||Gi-G 2 ||Q = sup|Gi(t)-G 2 (t)|. 
teQ 

The following theorem shows that both F and H can be estimated con- 
sistently via an idealized algorithm. 

Theorem 5.1. Let {Qn} be a sequence of grids on the real line eventu- 
ally covering the support of F with corresponding grid sizes {"fn} satisfying 
"y n —)• 0. Let {c n } be a sequence of positive real numbers with c n — > c. Let H n 
be defined as 

(5-1) H n = argmin \\Ffi~ n - F n \\ Qn , 

H 

where H is a probability measure. 

Then we have (i) F$ ~ =^F a.s.; and (ii) H n ^>H a.s. 

The algorithm used in the theorem is not practical for two reasons. First, 
it is not possible to optimize over all probability measures H. But simi- 
larly to El Karoui (2008), we can show that it is sufficient to optimize over 
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all probability measures that are sums of atoms, the location of which is 
restricted to a fixed-size grid, with the grid size vanishing asymptotically. 

Corollary 5.1. Let {Q n } be a sequence of grids on the real line even- 
tually covering the support of F with corresponding grid sizes {jn} satisfying 
j n — > 0. Let {c n } be a sequence of positive real numbers with c n — > c. Let V n 
denote the set of all probability measures that are sums of atoms belonging to 
the grid {J n /T n , (J n + 1)/T n , . . . ,K n /T n } with T n — >• oo, J n being the largest 
integer satisfying J n /T n < Ai, and K n being the smallest integer satisfying 

K n /T n > X p . Let H n be defined as 

(5-2) H n = argmin \\F Sta - F n \\ Qn . 

HeVn 

Then we have (i) Fg ~ => F a.s.; and (ii) H n =^H a.s. 

But even restricting the optimization over a manageable set of proba- 
bility measures is not quite practical jret for a second reason. Namely, to 
compute Fg~ exactly for a given H, one would have to (numerically) 
solve the Marcenko-Pastur equation for an infinite number of points. In 
practice, we can only afford to solve the equation for a finite number of 
points and then approximate Fs - by trapezoidal integration. Fortunately, 
this approximation does not negatively affect the consistency of our estima- 
tors. 

Let G be a c.d.f. with continuous density g and compact support [a, b] . For 
a grid Q = {. . . , £_i,£oj h, ■ ■ ■} covering the support of G, the approximation 
to G via trapezoidal integration over the grid Q, denoted by Gq, is obtained 
as follows. For t € [a, b], let Ji Q = max{£; : t^ < a} and J^i = mm{k : t < t^}. 
Then 

(5.3) G Q «t) = (tk+1 ~ ^Mtk)+g(t k+1 )} 

k=Ji 

Now turn to the special case G = Fg ~ and Q = Q n . In this case, we denote 
the approximation to Fg ~ via trapezoidal integration over the grid Q n by 

F~ ~ 

Corollary 5.2. Assume the same assumptions as in Corollary 5.1. 
Let H n be defined as 

(5.4) H n = argmin \\F S tn Qn - F n \\ Qn . 

Hev n 

Let rriF- _ (A), S n , and P n be defined as in (4-1), (4-%) an d (4-3), respec- 



tively. Then: 
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(i) Fs ~ a.s. 

(ii) H n ^ H a.s. 

(iii) For any 5 > 0, rhp- _ (A) — > mp(A) a.s. uniformly in A € \z\ + 

5, z 2 -5\. 

(iv) \\S n -SZ\\^0 a.s. 

(v) \\P n -P?\\-K} a.s. 

5.2. Implementation details. 

Decomposition of the c.d.f. of population eigenvalues. As discussed be- 
fore, it is not practical to search over the set of all possible c.d.f. 's H. 
Following El Karoui (2008), we project H onto a certain basis of c.d.f. 's 
(-^fc)ifc=i ... Ki where K goes to infinity along with n and p. The projection 
of H onto this basis is given by the nonnegative weights W\ , . . . , wk , where 

K K 

(5.5) ViGlR H(t)^H(t) = ^2w k M k (t) and ^w k = l. 

k=l k=l 

Thus, ou£ estimator for F will be a solution to the Marcenko-Pastur equa- 
tion for H given by equation (5.5) for some (w k )k=i,...,Ki an d for c = p/n. It 
is just a matter of searching over all sets of nonnegative weights summing 
up to one. 

Choice of basis. We base the c.d.f. 's {M k )k=i,...,K on a grid of p equally 
spaced points on the interval [Ai,A p ]. 

i — 1 

(5.6) Xi = Ai H — — (Ap-Ai) for i = 1, ...,p. 

Thus xi = Ai and x p = A p . We then form the basis {Mi, . . . ,M k } as the 
union of three families of c.d.f. 's: 

(1) the indicator functions l\ Xit +oo) (* = 1> • • • >p)l 

(2) the c.d.f. 's whose derivatives are linearly increasing on the interval [sj_x, Xi] 
and zero everywhere else (i = 2, . . . ,p); 

(3) the c.d.f. 's whose derivatives are linearly decreasing on the interval 
[xi_i, xi\ and zero everywhere else (i = 2, . . . ,p). 

This list yields a basis (Affc)fc = i of dimension K = 3p — 2. Notice that by 
the theoretical results of Section 5.1, it would be sufficient to use the first 
family only. Including the second and third families in addition cannot make 
the approximation to H any worse. 

Trapezoidal integration. For a given H = X/fe=i w k^ki it i s computa- 
tionally too expensive (in the context of an optimization procedure) to solve 
the Marcenko-Pastur equation for mp(z) over all z £ C + . It is more effi- 
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cient to solve the Marcenko-Pastur equation only for rhi?(xj) (i = 1, . . . ,p), 
and to use the trapezoidal approximation formula to deduce from it F{xi) 
(i = 1, . . . , p). The trapezoidal rule gives 

Vi = l, . . . , P F(xi) = J2 Xj '\/ J : /-V,; + ^^F'( Xl ) 

^ ^ = g (x j+1 -gj_i)Im[m F (a 3 -)] 

i=i 27F 

(gi ~ Xj_i)Im[rhF(xj)] 
+ 2vr 

with the convention xq = 0. 

Objective function. The objective function measures the distance Jje- 
tween F n and the F that solves the Marcenko-Pastur equation for H = 
Ylk=i w kMk and for c = p/n. Traditionally, F n is defined as cadlag, that is, 
F n (^i) = 1/p an d -^n(Ap) = 1. However, there is a certain degree of arbitrari- 
ness in this convention: why is F n (X p ) equal to one but F n (Ai) not equal 
to zero? By symmetry, there is no a priori justification for specifying that 
the largest eigenvalue is closer to the supremum of the support of F than 
the smallest to its infimum. Therefore, a different convention might be more 
appropriate in this case, which leads us to the following definition: 

(5.8) Vi = l,...,p F n {\) = --^. 

p 2p 

This choice restores a certain element of symmetry to the treatment^of the 
smallest vs. the largest eigenvalue. From equation (5.8), we deduce F n {xi), 
for i = 2, . . . ,p — 1, by linear interpolation. With a sup-norm error penalty, 
this leads to the following objective function: 

(5.9) max \F(xi) - F n (xi)\, 

i=l,...,p 

where F{xi) is given by equation (5.7) for i = 1, . . . ,p. Using equation (5.7), 
we can rewrite this objective function as 

i 1 

E {x j+ i- Xj-i)lm[m F (xj)) (xj - Xi-i)lm[m F (xi)] ~ 
2vr 2vr n[Xi) ' 



max 
i=l,...,p 



Optimization program. We now have all the ingredients needed to state 
the optimization program that will extract the estimator of mjr(a;i), . . . , 
m,F{xp) from the observations Ai, . . . , X p . It is the following: 

1 (xj + i - Xj-i)lm[mj] (xi - Xi-i)lm[mi 



mm max 

mi,..,m p i=l,...,j 
wi,...,w K 



2vr + 2vr tn[Xl) 
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subject to 

Vj = l, 



,P 



rria 



A 

E 

k=l 



t[l - [p/n] - {p/njxjmj] 



■dM k (t), 



(5.10) 



K 

k=l 



,p 

K 



■Wfc > 0. 

The key is to introduce the variables rrij = rhp(xj) 1 for j = 1, . . . ,p. The con- 
straint in equation (5.10) imposes that m, is the solution to the Marcenko- 
Pastur equation evaluated as z € C + — > when if = X)fc=i w kM k - 

Real optimization program. In practice, most optimizers only accept real 
variables. Therefore it is necessary to decompose rrij into its real and imag- 
inary parts: aj = Re[m,] and 6j = Imfm^]. Then we can optimize separately 
over the two sets of real variables a,j and bj for j = l,...,p. The Marcenko- 
Pastur constraint in equation (5.10) splits into two constraints: one for the 
real part and the other for the imaginary part. The reformulated optimiza- 
tion program is 

1 (xj+i - Xj-i)bj , (xi - Xi-i)bi 



(5.11) 

subject to 
Vj 

(5.12) 



mm max 

ai,...,a p i=l,...,p 
bi,...,b p 

Wl,...,W K 



l,...,p 
K 



E 



2tt 



+ 



2tt 



(5.13) 



a i = Yl 

k=l 

Vj = l,...,p 

A" 



Re 



t[l — (p/n) — (p/n)xj(aj + ibj)] — xj 



E 

k=l 



+oo 



Im 



t[l — (p/n) — (p/n)xj(aj + ibj)] — Xj 



*3 } 



dM k (t), 



dAffc(t), 



A 



(5.14) J> fe = l, 

fe=i 

(5.15) Vj = l,...,p bj>0, 

(5.16) = !,...,# w fc >0. 
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Remark 5.1. Since the theory of Sections 4 and 5.1 partly assumes 
that rrij belongs to a compact set in C + bounded away from the real line, 
we might want to add to the real optimization program the constraints that 
— 1/e < o,j < 1/e and that e <bj < 1/e, for some small e > 0. Our simulations 
indicate that for a small value of e such as e = 10 -6 , this makes no difference 
in practice. 

Sequential linear programming. While the optimization program defined 
in equations (5.11)-(5.16) may appear daunting at first sight because of its 
non-convexity, it is, in fact, solved quickly and efficiently by off-the-shelf opti- 
mization software implementing Sequential Linear Programming (SLP). The 
key is to linearize equations (5.12)~(5.13), the two constraints that embody 
the Marcenko-Pastur equation, around an approximate solution point. Once 
they are linearized, the optimization program (5.11)-(5.16) becomes a stan- 
dard Linear Programming (LP) problem, which can be solved very quickly. 
Then we linearize again equations (5.12)-(5.13) around the new point, and 
this generates a new LP problem; hence the name: Sequential Linear Pro- 
gramming. The software iterates until a satisfactory degree of convergence 
is achieved. All of this is handled automatically by the SLP optimizer. The 
user only needs to specify the problem (5.11)-(5.16), as well as some starting 
point, and then launch the SLP optimizer. For our SLP optimizer, we se- 
lected a standard off-the-shelf commercial software: SNOPT™ Version 7.2— 
5; see Gill, Murray and Saunders (2002). While SNOPT™ was originally 
designed for sequential quadratic programming, it also handles SLP, since 
linear programming can be viewed as a particular case of quadratic pro- 
gramming with no quadratic term. 

Starting point. A neutral way to choose the starting point is to place 
equal weights on all the c.d.f.'s in our basis: Wk = 1/K(k = 1, . . . , K). Then it 
is necessary to solve the Marcenko-Pastur equation numerically once before 
launching the SLP optimizer, in order to compute the values of rfiF(xj) 
(j = 1, . . . ,p) that correspond to this initial choice of H = Ylk=i Mk/K- The 
initial values for Oj are taken to be Re[mp(xj)], and Im[m^(xj)] for bj 
(j = 1, . . . ,p). If the choice of equal weights Wk = 1/K for the starting point 
does not lead to convergence of the optimization program within a pre- 
specified limit on the maximum number of iterations, we choose random 
weights Wk generated i.i.d. ~ Uniform [0, 1] (rescaled to sum up to one), 
repeating this process until convergence finally occurs. In the vast majority 
of cases, the optimization program already converges on the first try. For 
example, over 1000 Monte Carlo simulations using the design of Section 6.1 
with p = 100 and n = 300, the optimization program converged on the first 
try 994 times and on the second try the remaining 6 times. 
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Fig. 1. Mean and median CPU times (in seconds) for optimization program as function 
of matrix dimension. The design is the one of Section 6.1 with n = 3p. Every point is the 
result of 1000 Monte Carlo simulations. 

Optimization time. Figure 1 gives some information on how the opti- 
mization time increases with the matrix dimension. 

The main reason for the rate at which the optimization time increases 
with p is that the number of grid points in (5.6) increases linearly in p. 
This linear rate is not a requirement for our asymptotic results. Therefore, 
if necessary, it is possible to pick a less-than-linear rate of increase in the 
number of grid points to speed up the optimization for very large matrices. 

Estimating the covariance matrix. Once the SLP optimizer has con- 
verged, it generates optimal values (a\, . . . , a*), (b\, . . . , b*) and (w;*, . . . , w* K ). 
The first two sets of variables at the optimum are used to estimate the oracle 
shrinkage factors. From the reconstructed m* F (xj) = a* + ib*, we deduce by 
linear interpolation rhp(Xj), for j = 1, . . . ,p. Our estimator of the covariance 

matrix S n is built by keeping the same eigenvectors as the sample covariance 
matrix, and dividing each sample eigenvalue Xj by the following correction 
factor: 



Corollary 5.2 assures us that the resulting bona fide nonlinear shrinkage 
estimator is asymptotically equivalent to the oracle estimator 5° r . Also, 
we can see that, as the concentration c n =p/n gets closer to zero, that 
is, as we get closer to fixed-dimension asymptotics, the magnitude of the 
correction becomes smaller. This makes sense because under fixed-dimension 
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asymptotics the sample covariance matrix is a consistent estimator of the 
population covariance matrix. 

Estimating the precision matrix. The output of the same optimization 
process can also be used to estimate the oracle shrinkage factors for the pre- 
cision matrix. Our estimator of the precision matrix S" 1 is built by keeping 
the same eigenvectors as the sample covariance matrix, and multiplying the 
inverse A" 1 of each sample eigenvalue by the following correction factor: 

1_ £ -2^A 7 Re[mUA 7 )l. 
n n 

Corollary 5.2 assures us that the resulting bona fide nonlinear shrinkage 
estimator is asymptotically equivalent to the oracle estimator P° r . 

Estimating H . We point out that the optimal values (w*, . . . , w* K ) gener- 
ated from the SLP optimizer yield a consistent estimate of H in the following 
fashion: 

K 
k=l 

This estimator could be considered an alternative to the estimator in- 
troduced by El Karoui (2008). The most salient difference between the two 
optimization algorithms is that our objective function tries to match F n 
on R, whereas his objective function tries to match (a function of) mp n 
on C + . The deeper we go into C + , the more "smoothed-out" is the Stielt- 
jes transform, as it is an analytic function; therefore, the more information 
is lost. However, the approach of El Karoui (2008) cannot get too close 
to the real line because mp n starts looking like a sum of Dirac functions 
(which are very ill-behaved) as one gets close to the real line, since F n is 
a step function. In a sense, the approach of El Karoui (2008) is to match 
a smoothed-out version of a sum of ill-behaved Diracs. In this situation, 
knowing "how much to smooth" is rather delicate, and even if it is done 
well, it still loses information. By contrast, we have no information loss be- 
cause we operate directly on the real line, and we have no problems with 
Diracs because we match F n instead of its derivative. The price to pay is 
that our optimization program is not convex, whereas the one of El Karoui 
(2008) is. But extensive simulations reported in the next section show that 
off-the-shelf nonconvex optimization software — as the commercial package 
SNOPT — can handle this particular type of a nonconvex problem in a fast, 
robust and efficient manner. 

It would have been of additional interest to compare our estimator of H 
to the one of El Karoui (2008) in some simulations. But when we tried to 
implement his estimator according to the implementation details provided, 
we were not able to match the results presented in his paper. Furthermore, 
we were not able to obtain his original software. As a result, we cannot make 
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any definite statements concerning the performance of our estimator of H 
compared to the one of El Karoui (2008). 

Remark 5.2 (Cross-validation estimator). The implementation of our 
nonlinear shrinkage estimators is not trivial and also requires the use of 
a third-party SLP optimizer. It is therefore of interest whether an alternative 
version exists that is easier to implement and exhibits (nearly) as good finite- 
sample properties. 

To this end an anonymous referee suggested to estimate the quanti- 
ties d* of (3.2) by a leave-one-out cross-validation method. In particular, let 
(Ai [k] , . . . , X p [k]) ; (ui [k] , . . . , u p [k] ) denote a system of eigenvalues and eigen- 
vectors of the sample covariance matrix computed from all the observed 
data, except for the kth observation. Then d* of (3.2) can be approximated 
by 

n 

dr = -Y,^i[k]'y k )\ 

k=l 

1/2 

where the p x 1 vector y k denotes the kth row of the matrix Y n = X n S n • 
The motivation here is that 

{ui[k]'y k ) 2 = Ui[k]'y k y' k Ui[k], 

where y k is independent of Ui[k] and E,(y k y' k ) = S n (even though y k y' k is of 
rank one only). 

We are grateful for this suggestion, since the cross-validation quanti- 
ties df can be computed without the use of any third-party optimization 
software, and the corresponding computer code is very short. 

On the other hand, the cross-validation estimator has three disadvantages. 
First, when p is large, it takes much longer to compute the cross-validation 
estimator. The reason is that the spectral decomposition of a p x p covariance 
matrix has to be computed n times as opposed to only one time. Second, 
the cross-validation method only applies to the estimation of the covariance 
matrix S n itself. It is not clear how to adapt this method to the (direct) 
estimation of the precision matrix S~ or any other smooth function of X] n . 
Third, the performance of the cross-validation estimator cannot match the 
performance of our method; see Section 6.8. 

Remark 5.3. Another approach proposed recently is the one of Mestre 
and Lagunas (2006). They use so-called "G-estimation," that is, asymptotic 
results that assume the sample size n and the matrix dimension p go to 
infinity together, to derive minimum variance beam formers in the context 
of the spatial filtering of electronic signals. There are several differences 
between their paper and the present one. First, Mestre and Lagunas (2006) 
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are interested in an optimal p x 1 weight vector w op t given by 
w pt = argmin u/£ n io subject to w'sd = 1, 



where is a p x 1 vector containing signal information. Consequently, 
Mestre and Lagunas (2006) are "only" interested in a certain functional 
of S n , while we are interested in the full covariance matrix S n and also in 
the full precision matrix X" 1 . Second, they use the real Stieltjes transform, 
which is different from the more conventional complex Stieltjes transform 
used in random matrix theory and in the present paper. Third, their random 
variables are complex whereas ours are real. The cumulative impact of these 
differences is best exemplified by the estimation of the precision matrix: 
Mestre and Lagunas [(2006), page 76] recommend (1 — p/njS" 1 , which is 
just a rescaling of the inverse of the sample covariance matrix, whereas our 
Section 3.2 points to a highly nonlinear transformation of the eigenvalues of 
the sample covariance matrix. 

6. Monte Carlo simulations. In this section, we present the results of var- 
ious sets of Monte Carlo simulations designed to illustrate the finite-sample 
properties of the nonlinear shrinkage estimator S n . As detailed in Section 3, 
the finite-sample optimal estimator in the class of rotation-equivariant es- 
timators is given by S* as defined in (3.3). Thus, the improvement of the 
shrinkage estimator S n over the sample covariance matrix will be measured 
by how closely this estimator approximates S* relative to the sample covari- 
ance matrix. More specifically, we report the Percentage Relative Improve- 
ment in Average Loss (PRIAL), which is defined as 



where S n is an arbitrary estimator of S n . By definition, the PRIAL of S n 
is 0%, while the PRIAL of S* is 100%. 

Most of the simulations will be designed around a population covariance 
matrix S n that has 20% of its eigenvalues equal to 1, 40% equal to 3 and 
40% equal to 10. This is a particularly interesting and difficult example 
introduced and analyzed in detail by Bai and Silverstein (1998). For concen- 
tration values such as c = 1/3 and below, it displays "spectral separation;" 
that is, the support of the distribution of sample eigenvalues is the union 
of three disjoint intervals, each one corresponding to a Dirac of population 
eigenvalues. Detecting this pattern and handling it correctly is a real chal- 
lenge for any covariance matrix estimation method. 

6.1. Convergence. The first set^of Monte Carlo simulations shows how 
the nonlinear shrinkage estimator S n behaves as the matrix dimension p and 
the sample size n go to infinity together. We assume that the concentration 
ratio c n =p/n remains constant and equal to 1/3. For every value of p (and 
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Fig. 2. Comparison of the nonlinear vs. linear shrinkage estimators. 20% of population 
eigenvalues are equal to 1, 40% are equal to 3 and 40% are equal to 10. Every point is the 
result of 1000 Monte Carlo simulations. 



hence n), we run 1000 simulations with normally distributed variables. The 
PRIAL is plotted in Figure 2. For the sake of comparison, we also report 
the PRIALs of the oracle 5° r and the optimal linear shrinkage estimator S n 
developed by Ledoit and Wolf (2004). 

One can see that the performance of the nonlinear shrinkage estimator S n 
converges quickly toward that of the oracle and of S*. Even for relatively 
small matrices of dimension p = 30, it realizes 88% of the possible gains 
over the sample covariance matrix. The optimal linear shrinkage estima- 
tor S n also performs well relative to the sample covariance matrix, but the 
improvement is limited: in general, it does not converge to 100% under large- 
dimensional asymptotics. This is because there are strong nonlinear effects in 
the optimal shrinkage of sample eigenvalues. These effects are clearly visible 
in Figure 3, which plots a typical simulation result ior j) = 100. 

One can see that the nonlinear shrinkage estimator S n shrinks the eigen- 
values of the sample covariance matrix almost as if it "knew" the correct 
shape of the distribution of population eigenvalues. In particular, the various 
curves and gaps of the oracle nonlinear shrinkage formula are well picked up 
and followed by this estimator. By contrast, the linear shrinkage estimator 
can only use the best linear approximation to this highly nonlinear transfor- 
mation. We also plot the 45-degrees line as a visual reference to show what 
would happen if no shrinkage was applied to the sample eigenvalues, that 
is, if we simply used S n . 

6.2. Concentration. The next set of Monte Carlo simulations shows how 
the PRIAL of the shrinkage estimators varies as a function of the concen- 
tration ratio c n =p/n if we keep the product px n constant and equal to 
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Fig. 3. Nonlinearity of the oracle shrinkage formula. 20% of population eigenvalues are 
equal to 1, 40% are equal to 3 and 40% are equal to 10. p = 100 and n — 300. 



9000. We keep the same population covariance matrix T, n as in Section 6.1. 
For every value of p/n, we run 1000 simulations with normally distributed 
variables. The respective PRIALs of 5° r , S n and S n are plotted in Figure 4. 

One can see that the nonlinear shrinkage estimator performs well across 
the board, closely in line with the oracle, and always achieves at least 90% 
of the possible improvement over the sample covariance matrix. By contrast, 
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Fig. 4. Effect of varying the concentration ratio c n =p/n. 20% of population eigenvalues 
are equal to 1, 40% are equal to 3 and 40% are equal to 10. Every point is the result of 
1000 Monte Carlo simulations. 
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the linear shrinkage estimator achieves relatively little improvement over the 
sample covariance matrix when the concentration is low. This is because, 
when the sample size is large relative to the matrix dimension, there is a lot 
of precise information about the optimal nonlinear way to shrink the sample 
eigenvalues that is waiting to be extracted by a suitable nonlinear procedure. 
By contrast, when the sample size is not so large, the information about the 
population covariance matrix is relatively fuzzy; therefore a simple linear 
approximation can achieve up to 93% of the potential gains. 

6.3. Dispersion. The third set of Monte Carlo simulations shows how 
the PRIAL of the shrinkage estimators varies as a function of the dispersion 
of population eigenvalues. We take a population covariance matrix S n with 
20% of its eigenvalues equal to 1, 40% equal to 1 + 2d/9 and 40% equal 
to 1 + d, where the dispersion parameter d varies from to 20. Thus, for 
d = 0, S n is the identity matrix and, for d = 9, S n is the same matrix as in 
Section 6.1. The sample size is n = 300 and the matrix dimension is p = 100. 
For every value of d, we run 1000 simulations with normally distributed 
variables. The respective PRIALs of S° r , S n and S n are plotted in Figure 5. 

One can see that the linear shrinkage estimator S n beats the nonlinear 
shrinkage estimator S n for very low dispersion levels. For example, when 
d = 0, that is, when the population covariance matrix is equal to the iden- 
tity matrix, S n realizes 99.9% of the possible improvement over the sample 
covariance matrix, while S n realizes "only" 99.4% of the possible improve- 
ment. This is because, in this case, linear shrinkage is optimal or (when d 
is strictly positive but still small) nearly optimal Hence there is nothing too 
little to be gained by resorting to a nonlinear shrinkage method. However, as 
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Fig. 5. Effect of varying the dispersion of population eigenvalues. 20% of population 
eigenvalues are equal to 1, 40% equal to 1 + 2d/9 and 40% equal to 1 + d, where the 
dispersion parameter d varies from to 20. p = 100 and n — 300. Every point is the result 
of 1000 Monte Carlo simulations. 
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Table 1 

Effect of nonnormality. 20% of population eigenvalues are equal to 1, 40% are equal to 3 
and 40% are equal to 10. 1000 Monte Carlo simulations with p = 100 and n = 300 





Average squared 
Frobenius loss 




PRIAL 


df = 3 


df = oo 


df = 3 


df = oo 


Sample covariance matrix 


5.856 


5.837 


0% 


0% 


Linear shrinkage estimator 


1.883 


1.883 


67.84% 


67.74% 


Nonlinear shrinkage estimator 


0.128 


0.133 


97.81% 


97.71% 


Oracle 


0.043 


0.041 


99.27% 


99.30% 



dispersion increases, linear shrinkage delivers less and less improvement over 
the sample covariance matrix, while nonlinear shrinkage retains a PRIAL 
above 96%, and close to that of the oracle. 

6.4. Fat tails. We also have some results on the effect of non-normality 
on the performance of the shrinkage estimators. We take the same popula- 
tion covariance matrix as in Section 6.1, that is, S n has 20% of its eigen- 
values equal to 1, 40% equal to 3 and 40% equal to 10. The sample size 
is n = 300, and the matrix dimension is p = 100. We compare two types of 
random variates: a Student t distribution with df = 3 degrees of freedom, 
and a Student t distribution with df = oo degrees of freedom (which is the 
Gaussian distribution). For each number of degrees of freedom df, we run 
1000 simulations. The respective PRIALs of 5° r , S n and S n are summarized 
in Table 1. 

One can see that departure from normality does not have any noticeable 
effect on performance. 

6.5. Precision matrix. The next set of Monte Carlo simulations focuses 
on estimating the precision matrix S~ . The definition of the PRIAL, in 
this subsection only, is given by 

~ f FJIITT — P*II 2 1 1 
(6.2) PRIAL = PRIAL(II n ) = 100 x \ 1 V ±JL rkl± \% 

I ElWSn 1 ~ P*\\ 2 } ) 

where LI n is an arbitrary estimator of S" 1 . By definition, the PRIAL of S"" 1 
is 0% while the PRIAL of P* is 100%. 

We take the same population eigenvalues as in Section 6.1. The concentra- 
tion ratio c n = p/n is set to the value 1/3. For various values of p between 
30 and 200, we run 1000 simulations with normally distributed variables. 
The respective PRIALs of P° r , P n , 5" 1 and S n are plotted in Figure 6. 

One can see that the nonlinear shrinkage method seems to be just as 
effective for the purpose of estimating the precision matrix as it is for the 
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Fig. 6. Estimating the precision matrix. 20% of population eigenvalues are equal to 1, 
40% are equal to 3 and 40% are equal to 10. Every point is the result of 1000 Monte Carlo 
simulations. 



purpose of estimating the covariance matrix itself. Moreover, there is a clear 
benefit in directly estimating the precision matrix by means of P n as opposed 
to the indirect estimation by means of S~ l (which on its own significantly 

outperforms S n ). 

6.6. Shape. Next, we study how the nonlinear shrinkage estimator S n 
performs for a wide variety of shapes of population spectral densities. This 
requires using a family of distributions with bounded support and which, 
for various parameter values, can take on different shapes. The best-suited 
family for this purpose is the beta distribution. The c.d.f. of the beta distri- 
bution with parameters (a,/3) is 

While the support of the beta distribution is [0, 1] , we shift it to the interval 
[1, 10] by applying a linear transformation. Thanks to the flexibility of the 
beta family of densities, selecting different parameters (a,/3) enables us to 
generate eight different shapes for the population spectral density: rectan- 
gular (1,1), linearly decreasing triangle (1,2), linearly increasing triangle 

(2.1) , circular (1.5,1.5), U-shaped (0.5,0.5), bell-shaped (5,5), left-skewed 

(5.2) and right-skewed (2,5); see Figure 7 for a graphical illustration. 

For every one of these eight beta densities, we take the population eigen- 
values to be equal to 
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Shape of the Beta Density for Various Parameters 




1 23456789 10 
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Fig. 7. Shape of the beta density for various parameter values. The support of the beta 
density has been shifted to the interval [1, 10] by a linear transformation. To enhance clar- 
ity, the densities corresponding to the parameters (2, 1) and (5, 2) have been omitted, since 
they are symmetric to (1,2) and (2,5), respectively, about the mid-point of the support. 

The concentration ratio c n =p/n is equal to 1/3. For various values of p 
between 30 and 200, we run 1000 simulations with normally distributed 
variables. The PRIAL of the nonlinear shrinkage estimator S n is plotted in 
Figure 8. 

As in all the other simulations presented above, the PRIAL of the non- 
linear shrinkage estimator always exceeds 88%, and more often than not 
exceeds 95%. To preserve the clarity of the picture, we do not report the 
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Fig. 8. Performance of the nonlinear shrinkage with beta densities. The various curves 
correspond to different shapes of the population spectral density. The support of the popu- 
lation spectral density is [1,10]. 
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PRIALs of the oracle and of the linear shrinkage estimator; but as usual, 
the nonlinear shrinkage estimator ranked between them. 

6.7. Fixed- dimension asymptotics. Finally, we report a set of Monte Carlo 
simulations that departs from the large-dimensional asymptotics assumption 
under which the nonlinear shrinkage estimator S n was derived. The goal is to 
compare it against the sample covariance matrix S n in the setting where S n 
is known to have certain optimality properties (at least in the normal case): 
traditional asymptotics, that is, when the number of variables p remains 
fixed while the sample size n goes to infinity. This gives as much advantage 
to the sample covariance matrix as it can possibly have. We fix the dimen- 
sion p = 100 and let the sample size n vary from n = 125 to n = 10,000. In 
practice, very few applied researchers are fortunate enough to have as many 
as n = 10,000 i.i.d. observations, or a concentration ratio c = p/n as low as 
0.01. The respective PRIALs of S° r , S n and S n are plotted in Figure 9. 

One crucial difference with all the previous simulations is that the tar- 
get for the PRIAL is no longer S 1 *, but instead the population covariance 
matrix £ itself, because now £ can be consistently estimated. Note that, 
since the matrix dimension is fixed, S n does not change with n; therefore, 
we can drop the subscript n. Thus, in this subsection only, the definition of 
the PRIAL is given by 



where S n is an arbitrary estimator of S. By definition, the PRIAL of S n is 
0% while the PRIAL of S is 100%. 



PRIAL = PRIAL(S n ) = 100 x \ 1 



E[||S n -£|| 2 ] 

E[l|Sn-£|| 2 ] 
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Fig. 9. Fixed-dimension asymptotics. 20% of population eigenvalues are equal to 1, 40% 
are equal to 3 and 40% are equal to 10. Variables are normally distributed. Every point is 
the result of 1000 Monte Carlo simulations. 
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In this setting, Ledoit and Wolf (2004) acknowledge that the improve- 
ment of the linear shrinkage estimator over the sample covariance matrix 
vanishes asymptotically, because the optimal linear shrinkage intensity van- 
ishes. Therefore it should be no surprise that the PRIAL of S n goes to 
zero in Figure 9. Perhaps more surprising is the continued ability of the 
oracle and the nonlinear shrinkage estimator to improve by approximately 
60% over the sample covariance matrix, even for a sample size as large as 
n = 10,000, and with no sign of abating as n goes to infinity. This is an 
encouraging result, as our simulation gave every possible advantage to the 
sample covariance matrix by placing it in the asymptotic conditions where 
it possesses well-known optimality properties, and where the earlier linear 
shrinkage estimator of Ledoit and Wolf (2004) is most disadvantaged. 

Intuitively, this is because the oracle shrinkage formula becomes more and 
more nonlinear as n goes to infinity for fixed p. Bai and Silverstein (1998) 
show that the sample covariance matrix exhibits "spectral separation" when 
the concentration ratio p/n is sufficiently small. It means that the sample 
eigenvalues coalesce into clusters, each cluster corresponding to a Dirac of 
population eigenvalues. Within a given cluster, the smallest sample eigen- 
values need to be nudged upward, and the largest ones downward, to the 
average of the cluster. In other words: full shrinkage within clusters, and 
no shrinkage between clusters. This is illustrated in Figure 10, which plots 
a typical simulation result for n = 10, 000. 2 

By detecting this intricate pattern automatically, that is, by discovering 
where to shrink and where not to shrink, the nonlinear shrinkage estima- 
tor S n showcases its ability to generate substantial improvements over the 
sample covariance matrix even for very low concentration ratios. 

6.8. Additional Monte Carlo simulations. 

6.8.1. Comparisons with other estimators. So far, we have compared the 
nonlinear shrinkage estimator S n only to the linear shrinkage estimator S n 
and the oracle estimator S° r to keep the resulting figures concise and legible. 

It is of additional interest to compare the nonlinear shrinkage estimator 
also to some other estimators from the literature. To this end we consider 
the following set of estimators: 

• The estimator of Stein (1975); 

• The estimator of Haff (1980); 

• The estimator recently proposed by Won et al. (2009). This estimator is 
based on a maximum likelihood approach, assuming normality, with an 
explicit constraint on the condition number of the covariance matrix. The 



2 For enhanced ability to distinguish linear shrinkage from the sample covariance matrix, 
we plot the two uninterrupted lines, even though the sample eigenvalues lie in three disjoint 
intervals (as can be seen from nonlinear shrinkage). 
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Fig. 10. Nonlinear shrinkage under fixed-dimension aymptotics. 20% of population 
eigenvalues are equal to 1, 40% are equal to 3 and 40% are equal to 10. p = 100 ana! 
n = 10,000 . 27ie oracle is not shown because it is virtually identical to the nonlinear shrink- 
age estimator. 

resulting estimator turns out to be a nonlinear shrinkage estimator as 
well: all "small" sample eigenvalues are brought up to a lower bound, all 
"large" sample eigenvalues are brought down to an upper bound, and all 
"intermediate" sample eigenvalues are left unchanged. 

Therefore, the corresponding transformation from sample eigenvalues 
to shrunk eigenvalues is step-wise linear: first flat, then a 45-degree line, 
and then flat again. The upper and lower bounds are determined by the 
desired constraint on the condition number k. If such an explicit constraint 
is not available from a priori information, a suitable constraint number k 
can be computed in a data-dependent fashion by a K-fold cross-validation 
method, which is the method we use. 3 

In particular, the cross-validation method selects k by optimizing over 
a finite grid {ki, k<z, . . . ,kl} that has to be supplied by the user. To 
this end we choose L = 10 and the m log-linearly spaced between 1 and 
n(S n ), for I = 1, . . . , L; here n(S n ) denotes the condition number of the 
sample covariance matrix. More precisely, for I = 1, . . . ,L, m = exp(uji), 
where {wi ,u>2, is the equally-spaced grid with oj\ = and ujl = 

l0g(*(Sn)). 

• The cross-validation version of the nonlinear shrinkage estimator S n ; see 
Remark 5.2. 



3 We are grateful to Joong-Ho Won for supplying us with corresponding Matlab code. 
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Fig. 11. Comparison of various estimators. 20% of population eigenvalues are equal to 1, 
40% are equal to 3 and 40% are equal to 10. Every point is the result of 1000 Monte Carlo 
simulations. 



We repeat the simulation exercises of Sections 6.1-6.3, replacing the oracle 
estimator and the linear shrinkage estimator with the above set of other 
estimators. The respective PRIALs of the various estimators are plotted in 
Figures 11-13. 

One can see that the nonlinear shrinkage estimator S n outperforms all 
other estimators, with the cross-validation version of S n in second place, 
followed by the estimators of Stein (1975), Won et al. (2009) and Haff 
(1980). 
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Fig. 12. Effect of varying the concentration ratio c n =p/n. 20% of population eigenval- 
ues are equal to 1, 40% are equal to 3 and 40% are equal to 10. Every point is the result 
of 1000 Monte Carlo simulations. 
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Fig. 13. Effect of varying the dispersion of population eigenvalues. 20% of population 
eigenvalues are equal to 1, 40% equal to 1 + 2d/9 and 40% equal to 1 + d. where the 
dispersion parameter d varies from to 20. p = 100 and n — 300. Every point is the result 
of 1000 Monte Carlo simulations. 



6.8.2. Comparisons based on a different loss function. So far, the PRIAL 
has been based on the loss function 

L r (S n , S n ) = ||S n — Sn|| 2 . 

It is of additional interest to add some comparisons based on a different loss 
function. To this end we use the scale-invariant loss function proposed by 
James and Stein (1961), namely 

(6.3) L J5 (£ n ,£ n ) = trace^S" 1 ) - log det^S^ 1 ) -p. 

We repeat the simulation exercises of Sections 6.1-6.3, replacing L Fr 
with L JS . The respective PRIALs of , S n , and S n are plotted in Fig- 
ures 14-16. 

One can see that the results do not change much qualitatively. If anything, 
the comparisons are now even more favorable to the nonlinear shrinkage 
estimator, in particular when comparing Figure 5 to Figure 16. 

7. Conclusion. Estimating a large-dimensional covariance matrix is a very 
important and challenging problem. In the absence of additional information 
concerning the structure of the true covariance matrix, a successful approach 
consists of appropriately shrinking the sample eigenvalues, while retaining 
the sample eigenvectors. In particular, such shrinkage estimators enjoy the 
desirable property of being rotation-equivariant. 

In this paper, we have extended the linear approach of Ledoit and Wolf 
(2004) by applying a nonlinear transformation to the sample eigenvalues. 
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Fig. 14. Comparison of the nonlinear vs. linear shrinkage estimators. 20% of population 
eigenvalues are equal to 1, 40% are equal to 3 and 40% are equal to 10. The PRIALs 
are based on the James-Stein (1961) loss function (6.3). Every point is the result of 1000 
Monte Carlo simulations. 



The specific transformation suggested is motivated by the oracle estima- 
tor of Ledoit and Peche (2011), which in turn was derived by studying 
the asymptotic behavior of the finite-sample optimal rotation-equivariant 
estimator (i.e., the estimator with the rotation-equivariant property that 
is closest to the true covariance matrix when distance is measured by the 
Frobenius norm). 
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Fig. 15. Effect of varying the concentration ratio ~c n — p/n. 20% of population eigenval- 
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Fig. 16. Effect of varying the dispersion of population eigenvalues. 20% of population 
eigenvalues are equal to 1, 40% equal to 1 + 2d/9 and 40% equal to 1 + d. where the 
dispersion parameter d varies from to 20. p = 100 and n = 300. The PRIALs are based 
on the James and Stein (1961 ) loss function (6.3). Every point is the result of 1000 Monte 
Carlo simulations. 



The oracle estimator involves the Stieltjes transform of the limiting spec- 
tral distribution of the sample eigenvalues, evaluated at various points on 
the real line. By finding a way to consistently estimate these quantities, 
in a uniform sense, we have been able to construct a bona fide nonlinear 
shrinkage estimator that is asymptotically equivalent to the oracle. 

Extensive Monte Carlo studies have demonstrated the improved finite- 
sample properties of our nonlinear shrinkage estimator compared to the 
sample covariance matrix and the linear shrinkage estimator of Ledoit and 
Wolf (2004), as well as its fast convergence to the performance of the oracle. 
In particular, when the sample size is very large compared to the dimension, 
or the population eigenvalues are very dispersed, the nonlinear shrinkage 
estimator still yields a significant improvement over the sample covariance 
matrix, while the linear shrinkage estimator no longer does. 

Many statistical applications require an estimator of the inverse of the 
covariance matrix, which is called the precision matrix. We have modified 
our nonlinear shrinkage approach to this alternative problem, thereby con- 
structing a direct estimator of the precision matrix. Monte Carlo studies 
have confirmed that this estimator yields a sizable improvement over the 
indirect method of simply inverting the nonlinear shrinkage estimator of the 
covariance matrix itself. 

The scope of this paper is limited to the case where the matrix dimension 
is smaller than the sample size. The other case, where the matrix dimension 
exceeds the sample size, requires certain modifications in the mathematical 
treatment, and is left for future research. 
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